-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LLM analyzers #508
base: master
Are you sure you want to change the base?
Add LLM analyzers #508
Conversation
Outstanding to do items:
|
# The data summary is un-informative and verbose | ||
if DataSummary in model.attributes: | ||
del model.attributes[DataSummary] | ||
return serialize(make_serializable(model.attributes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this essentially a re-implementation of what the PJSONSerializationService
does? Is there a reason to not use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PJSON has types alongside the values. This is useful for deserializing PJSON values, but would make this code overly complicated, since we never need to deserialize the resulting text and thus would end up adding logic to drop the types anyway.
Moreover, PJSON is nice as a wire format, but is not optimized for human-readable text "serialization." We want to deviate from its representation of some values to make the resulting text easier for the LLM to understand.
I also (personally) just don't like the PJSON serialization service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure that the serialized PJSON would hinder the LLM analysis? This sounds like a hypothesis and not something that was tested.
We should, whenever possible, avoid writing ~80 lines of code if it can be avoided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could surely get something working with PJSON. The (admittedly untested) hypothesis is that the code to do so would be as long as this code, less clear than this code, and dependent on undocumented serialized PJSON structure, which may change in the future, necessitating changes here.
If you still think I should rewrite this to use PJSON, I will.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See commenda
# Install Ollama | ||
RUN curl -L "https://ollama.com/download/ollama-linux-""$TARGETARCH"".tgz" | tar -C /usr/ -xzv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we want to add this to the Docker image?
This feels like an incomplete install step -- we're adding a dependency that requires the user to download/install more things to actually use it. The impact is a larger image size for all users, plus more work for users who actually want to use the LLM feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you propose doing it instead?
It would be ideal to have this whole thing in a separate package so that users could conditionally include the AI components. Then we could install Ollama and pull a model without worrying about unnecessarily inflating the core image. But one of the original constraints was that this should be in OFRAK core.
If we don't have Ollama installed, then we can't run the tests without having the tests themselves pull down and install the binary. As far as I know, in other OFRAK tests, there is no precedent for installing a dependency like this. It would be doable, but seems a little weird.
On the other hand, we could pull a model in the OFRAK core Dockerstub after installing Ollama, but this seems like a bad idea. We don't want to add another several-hundred-MB dependency to the OFRAK core images. We especially don't want to add models this large to the base image if the user is just going to use OpenAI anyway.
Installing Ollama, but not pulling a model, seemed like a reasonable tradeoff. If it seems wrong to you, I would appreciate guidance on a better way to do it.
ofrak_injector.discover(ofrak_ghidra) | ||
|
||
|
||
async def test_llm_component(ofrak_context: OFRAKContext, model: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the https://github.com/redballoonsecurity/ofrak/blob/master/ofrak_core/test_ofrak/unit/component/analyzer/analyzer_test_case.py#L34 to run these tests?
Right now these tests are not validating that the analyzer successfully generates the needed attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do that validation in these tests. It does not make sense to use AnalyzerTests
subclass here since the analyzer must be run with resource.run
. It requires a config, and thus cannot be called with resource.analyze
, as happens in AnalyzerTests
methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the tests do call get_attributes
at the end, so they're validating that the attributes are added.
# The data summary is un-informative and verbose | ||
if DataSummary in model.attributes: | ||
del model.attributes[DataSummary] | ||
return serialize(make_serializable(model.attributes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure that the serialized PJSON would hinder the LLM analysis? This sounds like a hypothesis and not something that was tested.
We should, whenever possible, avoid writing ~80 lines of code if it can be avoided.
One sentence summary of this PR (This should go in the CHANGELOG!)
Add AI-enhanced LLM analyzers.
Anyone you think should look at this, specifically?
@whyitfor